ˆπ(x) = exp(ˆα + ˆβ T x) 1 + exp(ˆα + ˆβ T.

Similar documents
The material for categorical data follows Agresti closely.

Definition The Poisson regression model states that Y 1,..., Y n are independent random variables with

Definition The binary regression model states that Y 1,..., Y n are independent random variables with

NATIONAL UNIVERSITY OF SINGAPORE EXAMINATION (SOLUTIONS) ST3241 Categorical Data Analysis. (Semester II: )

Chapter 1 Statistical Inference

STAT 7030: Categorical Data Analysis

9 Generalized Linear Models

SCHOOL OF MATHEMATICS AND STATISTICS. Linear and Generalised Linear Models

NATIONAL UNIVERSITY OF SINGAPORE EXAMINATION. ST3241 Categorical Data Analysis. (Semester II: ) April/May, 2011 Time Allowed : 2 Hours

A Generalized Linear Model for Binomial Response Data. Copyright c 2017 Dan Nettleton (Iowa State University) Statistics / 46

Exam Applied Statistical Regression. Good Luck!

Categorical data analysis Chapter 5

Logistic Regressions. Stat 430

Homework 5: Answer Key. Plausible Model: E(y) = µt. The expected number of arrests arrests equals a constant times the number who attend the game.

Answer Key for STAT 200B HW No. 8

Chapter 5: Logistic Regression-I

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F).

Homework 1 Solutions

1. Hypothesis testing through analysis of deviance. 3. Model & variable selection - stepwise aproaches

Stat/F&W Ecol/Hort 572 Review Points Ané, Spring 2010

Single-level Models for Binary Responses

ST3241 Categorical Data Analysis I Logistic Regression. An Introduction and Some Examples

Review of Statistics 101

Sections 4.1, 4.2, 4.3

8 Nominal and Ordinal Logistic Regression

UNIVERSITY OF TORONTO Faculty of Arts and Science

Various Issues in Fitting Contingency Tables

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F).

ST3241 Categorical Data Analysis I Multicategory Logit Models. Logit Models For Nominal Responses

Logistic Regression. James H. Steiger. Department of Psychology and Human Development Vanderbilt University

Generalized Linear Models

Multinomial Logistic Regression Models

ST3241 Categorical Data Analysis I Generalized Linear Models. Introduction and Some Examples

Confidence Intervals, Testing and ANOVA Summary

Lecture 14: Introduction to Poisson Regression

Modelling counts. Lecture 14: Introduction to Poisson Regression. Overview

Generalized Linear Models

Class Notes: Week 8. Probit versus Logit Link Functions and Count Data

Hierarchical Generalized Linear Models. ERSH 8990 REMS Seminar on HLM Last Lecture!

Linear Regression Models P8111

Introduction to the Logistic Regression Model

Lecture 4 Multiple linear regression

10. Alternative case influence statistics

An ordinal number is used to represent a magnitude, such that we can compare ordinal numbers and order them by the quantity they represent.

Classification. Chapter Introduction. 6.2 The Bayes classifier

Review of One-way Tables and SAS

Ch 2: Simple Linear Regression

Simple logistic regression

Stat 579: Generalized Linear Models and Extensions

Review: what is a linear model. Y = β 0 + β 1 X 1 + β 2 X 2 + A model of the following form:

Correlation and regression

Generalized Linear Models: An Introduction

Solutions for Examination Categorical Data Analysis, March 21, 2013

Section 4.6 Simple Linear Regression

Analysing categorical data using logit models

1/15. Over or under dispersion Problem

STATISTICS 110/201 PRACTICE FINAL EXAM

Introduction to Statistical modeling: handout for Math 489/583

Multiple linear regression S6

Logistic regression. 11 Nov Logistic regression (EPFL) Applied Statistics 11 Nov / 20

Log-linear Models for Contingency Tables

UNIVERSITY OF MASSACHUSETTS Department of Mathematics and Statistics Basic Exam - Applied Statistics Thursday, August 30, 2018

BMI 541/699 Lecture 22

Solutions to the Spring 2018 CAS Exam MAS-1

Generalized Linear Models

Poisson regression 1/15

THE PEARSON CORRELATION COEFFICIENT

Figure 36: Respiratory infection versus time for the first 49 children.

Logistic Regression. Interpretation of linear regression. Other types of outcomes. 0-1 response variable: Wound infection. Usual linear regression

EPSY 905: Fundamentals of Multivariate Modeling Online Lecture #7

11. Generalized Linear Models: An Introduction

22s:152 Applied Linear Regression. Example: Study on lead levels in children. Ch. 14 (sec. 1) and Ch. 15 (sec. 1 & 4): Logistic Regression

LISA Short Course Series Generalized Linear Models (GLMs) & Categorical Data Analysis (CDA) in R. Liang (Sally) Shan Nov. 4, 2014

STAT 525 Fall Final exam. Tuesday December 14, 2010

Goodness-of-Fit Tests for the Ordinal Response Models with Misspecified Links

BIO5312 Biostatistics Lecture 13: Maximum Likelihood Estimation

Unit 11: Multiple Linear Regression

Administration. Homework 1 on web page, due Feb 11 NSERC summer undergraduate award applications due Feb 5 Some helpful books

Cohen s s Kappa and Log-linear Models

Model Selection in GLMs. (should be able to implement frequentist GLM analyses!) Today: standard frequentist methods for model selection

9. Linear Regression and Correlation

MS-C1620 Statistical inference

STA 303 H1S / 1002 HS Winter 2011 Test March 7, ab 1cde 2abcde 2fghij 3

REVISED PAGE PROOFS. Logistic Regression. Basic Ideas. Fundamental Data Analysis. bsa350

Unit 6 - Introduction to linear regression

Three-Way Tables (continued):

UNIVERSITY OF TORONTO. Faculty of Arts and Science APRIL 2010 EXAMINATIONS STA 303 H1S / STA 1002 HS. Duration - 3 hours. Aids Allowed: Calculator

Model Estimation Example

Regression modeling for categorical data. Part II : Model selection and prediction

Section Poisson Regression

Answer Key for STAT 200B HW No. 7

Regression Diagnostics for Survey Data

Statistics in medicine

Multiple Logistic Regression for Dichotomous Response Variables

TABLES AND FORMULAS FOR MOORE Basic Practice of Statistics

Problems. Suppose both models are fitted to the same data. Show that SS Res, A SS Res, B

Linear Regression. In this lecture we will study a particular type of regression model: the linear regression model

Scatter plot of data from the study. Linear Regression

Ch 6: Multicategory Logit Models

Chapter 4: Generalized Linear Models-I

Transcription:

Exam 3 Review Suppose that X i = x =(x 1,, x k ) T is observed and that Y i X i = x i independent Binomial(n i,π(x i )) for i =1,, N where ˆπ(x) = exp(ˆα + ˆβ T x) 1 + exp(ˆα + ˆβ T x) This is called the full model for logistic regression and the (k + 1) parameters α, β 1,, β k are estimated For the saturated model, the Y i X i = x i independent Binomial(n i,π i ) for i = 1,, N where ˆπ i = Y i /n i This model estimates the N parameters π i Let l SAT (π 1,, π n ) be the likelihood function for the saturated model and let l FULL (α, β) be the likelihood function for the full model Let L SAT = log l SAT (ˆπ 1,, ˆπ N ) be the log likelihood function for the saturated model evaluated at the MLE (ˆπ 1,, ˆπ N ) and let L FULL = log l FULL (ˆα, ˆβ) be the log likelihood function for the full model evaluated at the MLE (ˆα, ˆβ) Then the deviance D = G 2 = 2(L FULL L SAT ) The degrees of freedom for the deviance = df FULL = N k 1 where N is the number of parameters for the saturated model and k + 1 is the number of parameters for the full model The saturated model is usually not very good for binary data (all n i = 1) or if the n i are small The saturated model can be good if all of the n i are large or if π i is very close to 0 or 1 whenever n i is small If X χ 2 d then E(X) =d and V (X) =2d An observed value of x>d+3 d is unusually large and an observed value of x<d 3 d is unusually small When the saturated model is good, a rule of thumb is that the logistic regression model is ok if G 2 N k 1 (or if G 2 N k 1+3 N k 1) An estimated sufficient summary or ESS plot is a plot of w i =ˆα + ˆβ T x i versus Y i with the logistic curve of fitted proportions ˆπ(w i )= ew i 1+e w i added to the plot along with a step function of observed proportions 29) Suppose that w i takes many values (eg the LR model has a continuous predictor) and that k +1<< N Know that the LR model is good if the step function tracks the logistic curve of fitted proportions in the ESS plot Also know that you should check that the LR model is good before doing inference with the LR model See HW6 4 1

Response = Y Terms = (X 1,, X k ) Sequential Analysis of Deviance Total Change Predictor df Deviance df Deviance Ones N 1=df o G 2 o X 1 N 2 1 X 2 N 3 1 X k N k 1=df FULL G 2 FULL 1 ----------------------------------------- Data set = cbrain, Name of Fit = B1 Response = sex Terms = (cephalic size log[size]) Sequential Analysis of Deviance Total Change Predictor df Deviance df Deviance Ones 266 363820 cephalic 265 363605 1 0214643 size 264 315793 1 478121 log[size] 263 305045 1 107484 Know how to use the above output for the following test Assume that the ESS plot has been made and that the observed proportions track the logistic curve If the logistic curve looks like a line with small positive slope, then the predictors may not be useful The following test asks whether ˆπ(x i ) from the logistic regression should be used to estimate P (Y i =1 x i ) or if none of the predictors should be used and for all i =1,, N N N P (Y i =1) π Y i / n i i=1 i=1 30) The 4 step (log likelihood) deviance test is i) H o : β 1 = = β k =0 H A : not H o ii) test statistic G 2 (o F )=G 2 o G 2 FULL iii) The p value = P (W > G 2 (o F )) where W χ 2 k has a chi square distribution with k degrees of freedom Note that k = k +1 1 =df o df FULL = N 1 (N k 1) iv) Reject H o if the p value <δand conclude that there is a LR relationship between Y and the predictors X 1,, X k If p value δ, then fail to reject H o and conclude that there is not a LR relationship between Y and the predictors X 1,, X k See HW6 6a 2

After obtaining an acceptable full model where logit(π(x i )) = α + β 1 x i1 + + β k x ik = α + β T x, try to obtain a reduced model Y i X Ri = x Ri independent Binomial(n i,π(x Ri )) where logit(π(x Ri )) = α + β R1 x Ri1 + + β Rm x Rim = α R + β T Rx Ri and {x Ri1,, x Rim } {x 1,, x k } Let x R,m+1,, x Rk denote the k m predictors that are in the full model but not in the reduced model We want to test H o : β R,m+1 = = β Rk = 0 For notational ease, we will often assume that the predictors have been sorted and partitioned so that x i = x Ri for i =1,, k Then the reduced model uses predictors x 1,, x m and we test H o : β m+1 = = β k = 0 However, in practice this sorting is usually not done Assume that the ESS plot looks good Then we want to test Ho: the reduced model can be used instead of the full model versus H A : the full model is (significantly) better than the reduced model Fit the full model and the reduced model to get the deviances G 2 FULL and G2 RED 31) The 4 step change in deviance test is i) H o : the reduced model is good H A : use the full model ii) test statistic G 2 (R F )=G 2 RED G2 FULL iii) The p value = P (W >G 2 (R F )) where W χ 2 k m has a chi square distribution with k degrees of freedom Note that k is the number of predictors in the full model while m is the number of predictors in the reduced model Also notice that k m = (k +1) (m +1)=df RED df FULL = N m 1 (N k 1) iv) Reject H o if the p value <δand conclude that the full model is (significantly) better than the reduced model If p value δ, then fail to reject H o and conclude that the reduced model is good See HW6 6b 32) If the reduced model leaves out a single variable X i, then the change in deviance test becomes H o : β i = 0 versus H A : β i 0 This likelihood ratio is a competitor of the Wald test (see 28)) The likelihood ratio test is usually better than the Wald test if the sample size N is not large, but the Wald test is currently easier for software to produce For large N the test statistics from the two test tend to be very similar (asymptotically equivalent tests) 33) If the reduced model is good, then the EE plot of ˆα R + ˆβ T R x Ri versus ˆα + ˆβ T x i should be highly correlated with the identity line with unit slope and zero intercept Know how to use the following output to test the reduced model versus the full model 3

Response = Y Terms = (X 1,, X k ) (Full Model) Label Estimate Std Error Est/SE p-value Constant ˆα se(ˆα) z o,0 for Ho: α =0 x 1 ˆβ1 se( ˆβ 1 ) z o,1 = ˆβ 1 /se( ˆβ 1 ) for Ho: β 1 =0 x k ˆβk se( ˆβ k ) z o,k = ˆβ k /se( ˆβ k ) for Ho: β k =0 Degrees of freedom: N - k - 1 = df FULL Deviance: D = G 2 FULL Response = Y Terms = (X 1,, X m ) (Reduced Model) Label Estimate Std Error Est/SE p-value Constant ˆα se(ˆα) z o,0 for Ho: α =0 x 1 ˆβ1 se( ˆβ 1 ) z o,1 = ˆβ 1 /se( ˆβ 1 ) for Ho: β 1 =0 x m ˆβm se( ˆβ m ) z o,m = ˆβ k /se( ˆβ m ) for Ho: β m =0 Degrees of freedom: N - m - 1 = df RED Deviance: D = G 2 RED ----------------------------------------------------------------- Data set = Banknotes, Name of Fit = B1 (Full Model) Response = Status Terms = (Diagonal Bottom Top) Coefficient Estimates Label Estimate Std Error Est/SE p-value Constant 236049 506442 0466 06411 Diagonal -198874 372830-0533 05937 Bottom 236950 455271 0520 06027 Top 196464 606512 0324 07460 Degrees of freedom: 196 Deviance: 0009 Data set = Banknotes, Name of Fit = B2 (Reduced Model) Response = Status Terms = (Diagonal) Coefficient Estimates Label Estimate Std Error Est/SE p-value Constant 989545 219032 4518 00000 Diagonal -704376 155940-4517 00000 Degrees of freedom: 198 Deviance: 21109 4

34) Let π(x) = P (success x) = 1 P(failure x) where a success is what is counted and a failure is what is not counted (so if the Y i are binary, π(x) =P (Y i =1 x)) Then the estimated odds of success is ˆΩ(x) = ˆπ(x) 1 ˆπ(x) 35) In logistic regression, increasing a predictor x i by 1 unit (while holding all other predictors fixed) multiplies the estimated odds of success by a factor of exp( ˆβ i ) 36) Suppose that the binary response variable Y is conditionally independent of x given a single linear combination β T x of the predictors, written Y x β T x If the LR model holds and if the first SIR predictor ˆβ T SIR1 x and ˆα + ˆβ T x are highly correlated, then ( to a good approximation) Y x ˆα + ˆβ T x and Y x ˆβ T SIR1 x To make a binary response plot for logistic regression, fit SIR and the LR model and assume that the above conditions hold Place the first SIR predictor on the horizontal axis and the 2nd SIR predictor ˆβ T SIR2x on the vertical axis If Y = 0 use symbol 0 and if Y = 1 use symbol X If the LR model is good then consider the symbol density of X s and 0 s in a narrow vertical slice This symbol density should be approximately constant (up to binomial variation) from the bottom to the top of the slice (Hence the X s and 0 s should be mixed in the slice) The symbol density may change greatly as the slice is moved from the left to the right of the plot, eg from 0% to 100% If there are slices where the symbol density is not constant from top to bottom, then the LR model may not be good (eg a more complicated model may be needed) 37) Given a predictor x, sometimes x is not used by itself in the full LR model Suppose that Y is binary Then to decide what functions of x should be in the model, look at the conditional distribution of x Y = i for i =0, 1 These rules are used if x is an indicator variable or if x is a continuous variable distribution of x y = i functions of x to include in the full LR model x y = i is an indicator x x y = i N(µ i,σ 2 ) x x y = i N(µ i,σi 2 ) x and x 2 x y = i has a skewed distribution x and log(x) x y = i has support on (0,1) log(x) and log(1 x) 38) If w is a nominal variable with J levels, use J 1 (indicator or) dummy variables x 1,w,, x J 1,w in the full model 39) An interaction is a product of two or more predictor variables Interactions are difficult to interpret Often interactions are included in the full model and the reduced model without any interactions is tested The investigator is hoping that the interactions are not needed 5

40) A scatterplot of x vs Y is used to visualize the conditional distribution of Y x A scatterplot matrix is an array of scatterplots It is used to examine the marginal relationships of the predictors and response Place Y on the top or bottom of the scatterplot matrix and also mark the plotted points by a 0 if Y = 0 and by X if Y =1 Variables with outliers, missing values or strong nonlinearities may be so bad that they should not be included in the full model 41) Suppose that all values of the variable x are positive The log rule says add log(x) to the full model if max(x i )/ min(x i ) > 10 42) To make a full model, use points 37), 38), 40) and 41) and sometimes 39) The number of predictors in the full model should be much smaller than the number of data cases N Make an ESS plot to check that the full model is good 43) Variable selection is closely related to the change in deviance test for a reduced model You are seeking a subset I of the variables to keep in the model The AIC(I) statistic is used as an aid in backward elimination and forward selection The full model and the model with the smallest AIC are always of interest Create a full model The full model has a deviance at least as small as that of any submodel 44) Backward elimination starts with the full model with k variables and the predictor that optimizes some criterion is deleted Then there are k 1 variables left and the predictor that optimizes some criterion is deleted This process continues for models with k 2,k 3,, 3 and 2 predictors Forward selection starts with the model with 0 variables and the predictor that optimizes some criterion is added Then there is 1 variable in the model and the predictor that optimizes some criterion is added This process continues for models with 2, 3,, k 2 and k 1 predictors Both forward selection and backward elimination result in a sequence of k models {x 1 }, {x 1,x 2 },, {x 1,x 2,, x k 1 }, {x 1,x 2,, x k } = full model 45) For logistic regression, suppose that the Y i are binary for i = 1,, N Let N 1 = Y i = the number of 1 s and N 0 = N N 1 = the number of 0 s Rule of thumb: the final submodel should have m predictor variables where m is small with m min(n 1,N 0 )/10 46) Know how to find good models from output A good submodel I will use a small number of predictors, have a good ESS plot, and have a good EE plot A good LR submodel I should have a deviance G 2 (I) close to that of the full model in that the change in deviance test 31) would not be rejected Also the submodel should have a value of AIC(I) close to that of the examined model that has the minimum AIC value The LR output for model I should not have many variables with small Wald test p values 47) Heuristically, backward elimination tries to delete the variable that will increase the deviance the least An increase in deviance greater than 4 (if the predictor has 1 degree of freedom) may be troubling in that a good predictor may have been deleted In practice, the backward elimination program may delete the variable such that the submodel I with j predictors has 1) the smallest AIC(I), 2) the smallest deviance G 2 (I) or 3) the biggest p value (preferably from a change in deviance test but possibly from a Wald test) in the test Ho β i = 0 versus H A β i 0 where the current model with j +1 6

variables is treated as the full model 48) Heuristically, forward selection tries to add the variable that will decrease the deviance the most An increase in deviance less than 4 (if the predictor has 1 degree of freedom) may be troubling in that a bad predictor may have been added In practice, the forward selection program may add the variable such that the submodel I with j predictors has 1) the smallest AIC(I), 2) the smallest deviance G 2 (I) or 3) the smallest p value (preferably from a change in deviance test but possibly from a Wald test) in the test Ho β i = 0 versus H A β i 0 where the current model with j terms plus the predictor x i is treated as the full model (for all variables x i not yet in the model) 49) For logistic regression, let N 1 = number of ones and N 0 = N N 1 = number of zeroes A rough rule of thumb is that the full model should use no more than min(n 0,N 1 )/5 predictors and the final submodel should use no more than min(n 0,N 1 )/10 predictors 50) For loglinear regression, a rough rule of thumb is that the full model should use no more than N/5 predictors and the final submodel should use no more than N/10 predictors 51) Variable selection is pretty much the same for logistic regression and loglinear regression Suppose that the full model is good and is stored in M1 Let M2, M3, M4, and M5 be candidate submodels found after forward selection, backward elimination, etc Make a scatterplot matrix of M2, M3, M4, M5 and M1 Good candidates should have estimated linear predictors that are highly correlated with the full model estimated linear predictor (the correlation should be at least 09 and preferably greater than 095) For binary logistic regression, mark the symbols using the response variable Y See HW7 1, HW8 1, HW9 1 and HW 10 1 52) The final submodel I should have few predictors, few variables with large Wald p values (001 to 005 is borderline), a good ESS plot and an EE plot that clusters tightly about the identity line Do not use more predictors than the min AIC model I min and want AIC(I) AIC(I min ) + 7 For the change in deviance test, want pvalue 001 for variable selection (instead of δ = 005) If a factor has J-1 dummy variables, either keep all I-1 dummy variables or delete all J-1 dummy variables, do not delete some of the dummy variables 53) Know that when there is perfect classification in the binary logistic regression model, the LR MLE estimator does not exist and the output is suspect However, often the full model deviance is close to 0 and the deviance test correctly rejects Ho 54) Suppose that X i = x =(x 1,, x k ) T is observed and that Y i X i = x i independent Poisson(µ(x i )) for i =1,, N where ˆµ(x) = exp(ˆα + ˆβ T x) This is called the full model for loglinear regression and the (k + 1) parameters α, β 1,, β k are estimated Know how to predict ˆµ(x) Also Ŷ =ˆµ(x) See HW9 2, Q8 7

For the saturated model, the Y i X i = x i independent Poisson(µ i ) for i =1,, N where ˆµ i = Y i This model estimates the N parameters µ i The saturated model is usually bad An exception is when all NY i are large The comments on the deviance in the middle of p 1 still hold 55) An estimated sufficient summary or EY plot is a plot of w i =ˆα + ˆβ T x i versus Y i with the exponential curve of estimated means ˆµ(w i )=e w i added to the plot along with a lowess curve 56) Suppose that w i takes many values (eg the LLR model has a continuous predictor) and that k +1 << N Know that the LLR model is good if the lowess tracks the exponential curve of estimated means in the ESS plot Also know that you should check that the LLR model is good before doing inference with the LLR model See HW9 2 57) Know how to perform the 4 step deviance test This test is almost exactly the same as that in 30), but replace LR by LLR in the conclusion The output looks almost like that shown on p 2 See HW9 2, Q8 The deviance test for LLR asks whether ˆµ(x i ) from LLR should be used to estimate µ(x i ) or should none of the predictors be used so ˆµ = Y = N i=1 Y i /N 58) Know how to perform the 4 step Wald test This test is the same as 28) except replace LR by LLR 59) Know that a (Wald) 95% CI for β i is ˆβ i ± 196SE( ˆβ i ) 60) Know how to perform the 4 step change in deviance test The output is almost the same as that on p 4 and the test is exactly the same as that given in 31) For Ho, the parameters set to 0 are those that are in the full model but not the reduced model 61) Know what a lurking variable is 62) Know the difference between an observational study and an experiment A clinical trial is a randomized controlled experiment performed on humans Exam 3 is on Wednesday, April 19 and covers Agresti material including points 23) through 28) on the Exam 2 review 7 pages of notes You should know how to use a random number table to draw a simple random sample in order to divide units into 2 groups In Agresti, we have covered ch 1, 21, 22, 23, 24, 51, 52, 53, 54, and 55 but have skipped subsections 245, 246, 247, 533, 534, and 556 8